8 research outputs found
DNN Transfer Learning based Non-linear Feature Extraction for Acoustic Event Classification
Recent acoustic event classification research has focused on training
suitable filters to represent acoustic events. However, due to limited
availability of target event databases and linearity of conventional filters,
there is still room for improving performance. By exploiting the non-linear
modeling of deep neural networks (DNNs) and their ability to learn beyond
pre-trained environments, this letter proposes a DNN-based feature extraction
scheme for the classification of acoustic events. The effectiveness and
robustness to noise of the proposed method are demonstrated using a database of
indoor surveillance environments
Into-TTS : Intonation Template based Prosody Control System
Intonations take an important role in delivering the intention of the
speaker. However, current end-to-end TTS systems often fail to model proper
intonations. To alleviate this problem, we propose a novel, intuitive method to
synthesize speech in different intonations using predefined intonation
templates. Prior to the acoustic model training, speech data are automatically
grouped into intonation templates by k-means clustering, according to their
sentence-final F0 contour. Two proposed modules are added to the end-to-end TTS
framework: intonation classifier and intonation encoder. The intonation
classifier recommends a suitable intonation template to the given text. The
intonation encoder, attached to the text encoder output, synthesizes speech
abiding the requested intonation template. Main contributions of our paper are:
(a) an easy-to-use intonation control system covering a wide range of users;
(b) better performance in wrapping speech in a requested intonation with
improved pitch distance and MOS; and (c) feasibility to future integration
between TTS and NLP, TTS being able to utilize contextual information. Audio
samples are available at https://srtts.github.io/IntoTTS.Comment: Submitted to INTERSPEECH 202
An Empirical Study on L2 Accents of Cross-lingual Text-to-Speech Systems via Vowel Space
With the recent developments in cross-lingual Text-to-Speech (TTS) systems,
L2 (second-language, or foreign) accent problems arise. Moreover, running a
subjective evaluation for such cross-lingual TTS systems is troublesome. The
vowel space analysis, which is often utilized to explore various aspects of
language including L2 accents, is a great alternative analysis tool. In this
study, we apply the vowel space analysis method to explore L2 accents of
cross-lingual TTS systems. Through the vowel space analysis, we observe the
three followings: a) a parallel architecture (Glow-TTS) is less L2-accented
than an auto-regressive one (Tacotron); b) L2 accents are more dominant in
non-shared vowels in a language pair; and c) L2 accents of cross-lingual TTS
systems share some phenomena with those of human L2 learners. Our findings
imply that it is necessary for TTS systems to handle each language pair
differently, depending on their linguistic characteristics such as non-shared
vowels. They also hint that we can further incorporate linguistics knowledge in
developing cross-lingual TTS systems.Comment: Submitted to ICASSP 202